Handling of Numeric Ranges for Graph-Based Knowledge Discovery
نویسندگان
چکیده
Nowadays, graph-based knowledge discovery algorithms do not consider numeric attributes (they are discarded in the preprocessing step, or they are treated as alphanumeric values with an exact matching criterion), with the limitation to work with domains that do not have this type of attribute or finding patterns without numeric attributes. In this work, we propose a new approach for the numerical attributes handling for graphbased learning algorithms. Our approach shows how graph-based learning approaches increase their accuracy for the classification task and its descriptive power when they are able to use both nominal and numerical attributes. This new approach was tested with the Subdue system in the mutagenesis and PTC domains showing an accuracy increase around 16% compared to Subdue when it does not use our numerical attributes handling algorithm. In some research areas such as data mining and machine learning, the domain data representation is a fundamental aspect that determines in great measure the quality of the results of the discovery process. Depending on the domain, the Data Mining process analyzes a data collection (such as flat files, log files, relational databases, etc.) to discover patterns, relationships, rules, associations, or useful exceptions to be used for decision making processes and for the prediction of events and/or concept discovery. Graph based algorithms have been used for years to describe (in a natural way) flat, sequential, and structural domains with acceptable results (Gonzalez, Holder, and Cook 2002), (Ketkar, Holder, and Cook 2005). Some of these domains contain important numeric attributes (attributes with continuous values). Domains with continuous values are not appropriately manipulated by graph based knowledge discovery systems, although they can be appropriately represented. To the best of our knowledge there does not exist a graph based knowledge discovery algorithm that deals with continuous valued attributes. A solution proposed in the literature to approach this problem is the use of discretization techniques as a preprocessing or post-processing step but not at the knowledge discovery phase. However, we think that these techniques do not use all the available knowledge that can be taken adCopyright c © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. vantage of during the processing phase. Adding this capacity to graph-based algorithms will allow us improving the work with numeric attributes and in this way we will be able to improve the classification accuracy for the classification task and the patterns descriptive power. We will then be able to enhance our results for structural domains containing numerical attributes. Handling of Numerical Ranges In this section, we describe the numerical ranges generation algorithm (based on frequency histograms), which calculates distances using any of seven measures (the distance between the values is calculated by any of seven methods). The distances are: a modification to the Tanimoto distance, a modification of the Euclidean distance, a modification to the Manhattan distance, a modification to the Correlation distance, a modification to the Canberra distance, and two new distance measures that we propose. Our algorithm can be seen in figure 1. The algorithm shown in figure 1 works as Figure 1: Numerical Ranges Generation Algorithm follows. The general function (GenerateRange) receives the data set and the number of examples in it. In the first step we sort the numerical attribute in ascending way (Sort function). Next, we create a frequency histogram of the ordered data. Then, we create an initial table of ranges with four fields corresponding to the center of the range, its frequency, and its low and high limits. In this initial ranges table, the center, the low and high limits contain the same value take from the frequencies histogram (GenerateHistogram function). After 156 Proceedings of the Twenty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS 2010)
منابع مشابه
A new Approach for Handling Numeric Ranges for Graph-Based Knowledge Discovery
Discovering interesting patterns from structural domains is an important task in many real world domains. In recent years, graph-based approaches have demonstrated to be a straight forward tool to mine structural data. However, not all graph-based knowledge discovery algorithms deal with numerical attributes in the same way. Some of the algorithms discard the numeric attributes during the prepr...
متن کاملHandling of Numeric Ranges with the Subdue System
Graph-based knowledge discovery has become a powerful tool in the machine learning and data mining areas. It provides a flexible and natural data representation to describe real world domains. In this research work we present a novel algorithm for graph-based approaches to deal with numerical attributes during the data processing phase implemented in the Subdue system. Our experimental results ...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملConcept Hierarchy-Based Pattern Discovery in Time Series Database: A Case Study on Financial Database
Data Mining is the process of automatically searching large volumes of data for patterns and it is also a fairly recent and contemporary topic in computing. Nowadays, pattern discovery is a field within the area of data mining. In general, large volumes of time series data are contained in financial database and these data have some useful but not easy finding patterns in it and many financial ...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010